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Spatial fluctuations of guanine and cytosine base content (GC%) are studied by spectral analysis 
for the complete set of human genomic DNA sequences. We find that (i) the l/f a decay is universally 
observed in the power spectra of all twenty-four chromosomes, and that (ii) the exponent a « 1 
extends to about 10 7 bases, one order of magnitude longer than what has previously been observed. 
We further find that (iii) almost all human chromosomes exhibit a cross-over from qi w 1 ) at 

lower frequency to 02 < 1 (l//" 2 ) at higher frequency, typically occurring at around 30,000-100,000 
bases, while (iv) the cross-over in this frequency range is virtually absent in human chromosome 
22. In addition to the universal l/f a noise in power spectra, we find (v) several lines of evidence 
for chromosome-specific correlation structures, including a 500,000 bases long oscillation in human 
chromosome 21. The universal l/f a spectrum in human genome is further substantiated by a 
resistance to variance reduction in guanine and cytosine content when the window size is increased. 

PACS numbers: 87.10.+e, 87.14.Gg, 87.15.Cc, 02.50.-r, , 02.50.Tt, 89.75Da, 89.75.Fb, 05.40.-a 



I. INTRODUCTION 

By measuring the proportion of a signal's power S(f) 
falling into a range of frequency components /, a power 
spectrum of the form S(f) ~ 1//" distinguishes be- 
tween two prototypes of noise: white noise (a = 0) 
and Brownian noise (a = 2). The intermittent range, 
termed "1// noise", can practically be defined as 1//" 
(0.5 < a < 1.5). 1// noise was experimentally observed 
first in electric current fluctuations of the thermionic 
tube at the beginning of the nineteenth century 0. 
Since then, 1// noise has been found repeatedly in many 
other conducting materials 0. More generally, it has 
also been observed in wide ranges of natural as well 
as human-related phenomena, including traffic flow, star 
light, speech, music and human coordination 00. For 
biological sequences, such as DNA, the concept of slow- 
varying, multiple-length variations in the power of fre- 
quency components can be translated to long-ranging 
correlations in the spatial arrangement of the four bases 
adenine (A), cytosine (C), guanine (G) and thymine (T). 
One can categorize chemically A, C, G, and T as strong 
(G or C) or weak (A or T) bonding. It has been shown 
that fluctuations of the GC base content along a DNA se- 
quence are typically stronger correlated when compared 
to other possible binary classifications 00. Initial stud- 
ies of 1// noise in DNA sequences were motivated by a 
model of spatial 1// noise of symbolic sequence evolu- 
tion ■ Subsequently, empirical 1// spectra were indeed 
observed in non-protein-coding DNA sequences 0, and 
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their generality in DNA sequences was further illustrated 
in 0. 

1// noise has been detected in a variety of differ- 
ent species and taxonomic classes, including bacteria 
[Tof. yeasts ^l], insects 0, and other higher eukary- 
otic genomes. Integrating this and several other lines of 
evidence, a consensus on 1// noise in DNA sequences has 
emerged: (1) for DNA sequences of the order of 10 6 bases 
(1 Mb), 1// Q spectrum (a w 1) is consistently observed; 
(2) for isochores, which are DNA sequences of relatively 
homogeneous base concentration at least 300 • 10 3 bases 
(300 kb) long [H Q E3, 1// Q spectrum is also ob- 
served, but typically shows a smaller exponent a < 0.7 
mLLaLLa; (3) for DNA sequences of the order of sev- 
eral kb, the decay of S(f) is non-trivial and may depend 
on whether the sequence is protein-coding 0. The vi- 
ral DNA sequence of the A-phage, e.g., shows a single 
step in its GC base concentration and its spectrum is 
S(f) ~ l/.f 2 , which is characteristic of random block 
sequences .18] . We note that the universal scaling of 
S(f) ~ 1/ f a (a ?» 1) across all species discussed in 
has apparently been restricted to a length scale of 1 kb, 
by averaging the spectrum over many TV = 2 kb DNA 
segments. 

With the availability of the first completed version of 
the DNA sequence of human genome [13 , several studies 
have been able to demonstrate that the base-base corre- 
lation function T(d) (d distance between bases) of several 
DNA sequences follows a power-law decay, T(d) ~ 1/tf 7 . 
For instance, the DNA sequence of human chromosome 
22 shows statistically significant power-law correlations 
up to d — 1 Mb, and correlations in the DNA sequence 
of chromosomes 21 are statistically significant up to sev- 
eral Mb (with the scaling exponent 7 changing beyond 
a few kb) |2(J ■ While the DNA sequences of human 
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FIG. 1: Double-logarithmic representation of the human 
genome-wide length distribution of interspersed repeat se- 
quences, non-repetitive sequences, and sequences of unknown 
base composition (gaps). The length distribution of in- 
terspersed repeats and non-repetitive sequences exhibits a 
power-law-like decay, while that of gap sequences is scattered 
across different sequence length. The peaks at ~ 300 bases 
and several kb correspond to Alu and possibly LINE repeats. 
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FIG. 2: Distribution of genome-wide GC content (GC%) of 
the human genome for interspersed repeat sequences, non- 
repetitive sequences, and all ("overall") sequences with se- 
quence segments of 20 kb. The mode (peak location) of non- 
repetitive sequences is at ~35%, while the mode of repetitive 
sequences shifted to a higher GC% (~42%). The fraction of 
non-repetitive sequences with GC% > 50% is markedly larger 
as compared to the repetitive sequences. 



chromosomes 21 and 22 are about 34 Mb long, in order to 
estimate the limit of the range of 1// Q spectrum, longer 
sequences are necessary. 

After the release of the draft of the human genome se- 
quence in February 2001, about three years later in 2004, 
a dozen (out of 24) human chromosomes have been com- 
pleted with a sequence accuracy to following the standard 
of less than one error per 10,000 DNA bases (99.99% 
accuracy) |2lJ. Building upon the release of updated, 
high-quality sequence data, in the era of genomics we 
can now conduct a systematic analysis of several issues 
of 1/f noise in the DNA sequences of our own species 
Homo sapiens, which have been pursued over the last 
decade in a fragmentary manner. 

In this paper, we use the DNA sequences of the com- 
plete set of twenty-two autosomes and two sex chromo- 
somes to address the following issues: Is 1/f noise uni- 
versally present across the entire set of human genome 
sequences? Does 1/f noise extend to lower frequency 
ranges in longer DNA sequences? Is the decay of S(f) 
characterized by a single exponent a, or does it exhibit 
cross-overs (multiple scaling exponents)? Given the pres- 
ence of universal variations at multiple scales, do these 
co-exist with variations at chromosome-specific scales? 



II. DATA AND METHODS 

In this section, we introduce the data for human 
genome sequences, as well as the notation and defini- 
tions used throughout this study. Twenty-four chromo- 
somes are assembled in build 34 of the NCBI (human 
genome hgl6 release). Sequence data were downloaded 
from the UCSC human genome repository (available at 



http://genome.ucsc.edu/). Unsequenced bases are kept 
to preserve spacing between bases. Human chromosomes 
(Chr) 13, 14, 15, 21, and 22 contain large amount of un- 
sequenced bases in the left end of their DNA sequences, 
consisting of about 15%, 17%, 18%, 21%, and 29% of the 
individual chromosome size, respectively; 51% of chro- 
mosome Y are unsequenced. 

Our analysis on human DNA sequences is conducted 
using coarse-grained data. Each original sequence was 
transformed into a spatial series of GC content (GC%) 
values. To this end, we evenly partition a DNA sequence 
into N non-overlapping windows of length w bases, com- 
pute Pi{w) =GC%i for each window i, to obtain a spatial 
GC% series: 

{ Pi } = { Pi (w)} = {GC%i} i=l, 2, ...,N (1) 

Table 1 lists the corresponding window sizes for each hu- 
man chromosome. Since different human chromosomes 
have different sizes, whereas the number of partitions (N) 
is the same, the window lengths vary. 

Human DNA sequences contain a large fraction of in- 
terspersed repeats, i.e., copies of an ancestral sequence 
fragment that possess a high similarity between the 
duplicated and the ancestral sequence. One can de- 
tect interspersed repeats by using the program Repeat- 
Masker I23. "Soft-masked" annotations of interspersed 
repeats are taken from the DNA sequences of the UCSC 
human genome repository (http://genome.ucsc.edu/), 
where repetitive (non-repetitive) bases are annotated in 
small (capital) letters. Figure Q shows the length dis- 
tribution of the three sequences classes of uninterrupted 
non-repetitive, interspersed repeat, and gap sequences. 
Figure [21 shows the corresponding distribution of the 
genome-wide GC% for these three sequences classes. 

To investigate the effect of interspersed repeats, we 
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TABLE I: Average GC content (GC% or p), the window size 
(w) for partitions using N — 2 17 non-overlapping windows 
for twenty-four human chromosomes. Low-frequency scaling 
exponents ai are estimated from S(f;s = 3) ~ l// ai in 
the range of 10 -7 < / < 10 -5 base -1 , and high-frequency 
scaling exponents ai are estimated in the range of 1CT 5 < 
/ < 2 x 10 -4 base -1 . The difference between the two scaling 
exponents, Aa = ui — ati, are listed in the fifth column. 
Low- and high-frequency exponents for S(f) with substituted 
interspersed repeats are indicated by a[ and a'2, and their 
difference by Aa' = a' 2 — a[. 



Chr 


GC% 


w (kb) 


Oil 


Q2 


Aa 




a'i 


«2 


Aa' 


1 


41.7 


1.88 


0.88 


0.46 


0.42 





80 


0.29 


0.51 


2 


40.2 


1.86 


0.99 


0.51 


0.48 





96 


0.30 


0.66 


3 


39.7 


1.52 


0.95 


0.43 


0.53 





88 


0.27 


0.61 


4 


38.2 


1.46 


0.87 


0.34 


0.53 





75 


0.19 


0.57 


5 


39.5 


1.38 


0.89 


0.39 


0.51 





88 


0.23 


0.65 


6 


39.6 


1.30 


0.99 


0.36 


0.63 





86 


0.24 


0.63 


7 


40.7 


1.21 


0.97 


0.46 


0.51 





87 


0.33 


0.55 


8 


40.1 


1.12 


0.97 


0.42 


0.55 





91 


0.26 


0.66 


9 


41.3 


1.04 


0.96 


0.39 


0.57 





90 


0.28 


0.62 


10 


41.6 


1.03 


0.97 


0.52 


0.46 





95 


0.34 


0.61 


11 


41.6 


1.03 


1.05 


0.50 


0.55 





97 


0.35 


0.62 


12 


40.8 


1.01 


0.97 


0.39 


0.59 





89 


0.28 


0.61 


13 


38.5 


0.86 


0.83 


0.33 


0.50 





73 


0.24 


0.49 


14 


40.9 


0.80 


1.03 


0.36 


0.66 





95 


0.27 


0.68 


15 


42.2 


0.76 


0.90 


0.50 


0.40 





83 


0.39 


0.44 


16 


44.8 


0.69 


0.91 


0.51 


0.40 





81 


0.36 


0.45 


17 


45.5 


0.62 


0.98 


0.57 


0.42 





89 


0.44 


0.46 


18 


39.8 


0.58 


1.12 


0.40 


0.72 


1 


12 


0.28 


0.83 


19 


48.4 


0.49 


1.00 


0.56 


0.44 





81 


0.37 


0.45 


20 


44.1 


0.49 


0.87 


0.51 


0.36 





83 


0.30 


0.53 


21 


40.9 


0.36 


0.91 


0.33 


0.58 





86 


0.22 


0.64 


22 


47.9 


0.38 


0.90 


0.62 


0.28 





86 


0.40 


0.45 


X 


39.4 


1.17 


0.93 


0.38 


0.54 





73 


0.18 


0.55 


Y 


39.1 


0.38 


0.83 


0.38 


0.45 





70 


0.21 


0.49 



substitute them by random bases according to the chro- 
mosomal level of GC%. Transformed, repeat-substituted 
DNA sequences of original human chromosomes are dis- 
tinguished from original sequences. On the coarse- 
grained level, it is equivalent to the replacement in the 
{pi} (z = 1, 2, ... , TV) series of any values calculated from 
the interspersed repeats by a random value which is sam- 
pled from a Gaussian distribution; the mean and variance 
of this Gaussian distribution is the same as those of GC% 
in the original sequence. Another possibility consists in 
substituting repetitive sequences by by a constant value 
(e.g., the averaged GC% value of the original sequence). 
This method introduces additional correlations (and less 
variance) in the {pi} series, and is not adopted in this 
paper. 

Three different, albeit functionally related, measures 
are applied to the {pi} series: the power spectrum as a 
function of the frequency S(f), the correlation function 
T(d) as a function of the distance d between windows, 
and variance a 2 (w) of GC% series as a function of the 
window size w. 



First, we conduct spectral analyses by calculating the 
power spectrum, the absolute squared-average of the 
Fourier transform, defined as: 



S(J) 



1 

N 



N 

E 

k=l 



Pk ■ e 



-i2wkf/N 



(2) 



where N is the total number of windows, and / is mea- 
sured in units of cycle/ window, which can be converted 
to units of cycle/base by the window size (cf. Table 1). 

Coarse-graining "hides" base-base correlations at 
scales smaller than w bases. The choice of N = 2 17 
windows was made such that it is (i) sufficiently large to 
cover small-scale fluctuations, while (ii) at the same time 
sufficiently small so that the spectral analysis is compu- 
tationally feasible. As different chromosomes have differ- 
ence lengths, equal number of partitions leads to different 
window sizes w. 

The unsmoothed S(f), or periodogram, contains N/2 
independent spectral components. One can filter pe- 
riodograms to obtain a "smoothed" spectrum S(f;s), 
where s is the span-size parameter. Since filtering with 
a relatively large s-value possibly distorts the shape of 
S(f; s) at lower frequency components, different span- 
sizes are applied for different frequency ranges. 

The second measure applied to the {pi} series is the 
correlation function, T(d), which is computed from two 
truncated series of {ft}, p' = {pk} (k = 1, 2, . . . , N — d) 
and p" = {p k } (k = d + 1, d + 2, . . . , N): 



T(d) 



Cov(pV) 



VVa7(7yVar(p") 



(3) 



where Cov(p>") = (p'p") - (p')(p") and Var(p') = 
(P' 2 ) - (p') 2 (or Var(p") = (p" 2 ) - (p") 2 ) are the covari- 
ance and variance. Note that the T(d) defined in Eq.Q 
is slightly different from that defined using a periodic 
boundary condition. 

The third and final measure applied to the {p{\ series 
is the variance a 2 (w): 



a 2 (w)^(p(w) 2 )-(p(w))' 



(4) 



as a function of the window size w. The power spectrum, 
the correlation function, and the window-size-dependent 
variance are interrelated quantities [l6| : 



a 2 {w) 



r(o) 



i + -£>-d)r(d)f. 



(5) 



If S(f) - l/f a , T(d) - 1/eP, a 2 {w) - l/w p are power- 
law functions, then their scaling exponents are related by 
a = 1 — 7 and 7 = (3 . 

The calculation of S(f) and T(d) was carried out by 
the statistical package S-PLUS (Version 3.4, MathSoft, 
Inc.), and the type of filter implemented for S(f) is the 
Daniell-filter El. 
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III. 1// NOISE IS A UNIVERSAL FEATURE OF 
HUMAN DNA SEQUENCES 

In this section, we use the power spectrum S(f) to 
study GC% of human genome sequences, with the goals 
of testing the universality of 1// noise, quantifying differ- 
ent decay ranges for S(f) ~ 1//", and comparing S(f) 
across DNA sequences of different human chromosomes. 

Figure |3| shows for N = 2 17 GC% values the power 
spectra S(f) across all human chromosomes. We find 
that S(f) exhibits no clear plateau at low frequency 
(< 10~ 6 cycle/base) and increases steadily with decreas- 
ing frequency. The decay can be mathematically ap- 
proximated by a power-law of the form S(f) ~ 1// Q 
with a w 1. Table 1 lists for the frequency range 
/ = 10 Mb _1 -100 kb -1 the estimated scaling exponent 
<X\ for all chromosomes, using a best-fit regression of 
log 10 S(/;s = 3) = a + ailog 10 (/). We find that a\ is 
typically close to a\ w 1 with practically little variation 
across chromosomes. 

A closer inspection of Fig. |3 shows that the major- 
ity of 1// spectra undergo a cross-over from a\ p» 1 to 
a 2 < 1 at high frequency The deviation from a,\ ~ 1 
starts about 30-100 kb and continues at smaller dis- 
tances. Figure 0] illustrates this feature for S(f;s = 31) 
of the DNA sequences of Chrl5, Chr21, and Chr22 in 
more detail. We find that chromosomes 15 and 21 exhibit 
clear cross-overs at about 100 kb, while chromosome 22 
exhibits no apparent break-point. Table 1 contains for 
the frequency range of / = 100 kb _1 -5 kb -1 the cor- 
responding scaling exponents a%, obtained from the re- 
gression log 10 S(f; s = 3) = a + a 2 log 10 (/). We find a 
pronounced difference in absolute values between ai«l 
and Q!2 < 1, indicating a transition from the universal 
1// Ql (a\ ~ 1) spectrum at low frequency to a more 
flattened l//" 2 (a 2 < 1) spectrum at higher frequency. 

Figure [S^a) shows for all human chromosomes u\ and 
a 2 as a function of chromosome-specific GC%. The ma- 
jority of human chromosomes have a specific GC con- 
tent ranging between 38-43%, whereas chromosomes 16, 
17, 19, 20, and 22 have higher GC% up to 49%. While 
the low-frequency scaling exponent ot\ remains approx- 
imately independent of GC%, Fig. [SJa) shows that a 2 
increases with increasing GC% and gives rise to a posi- 
tive correlation between a 2 and GC%. 

The three chromosomes illustrated in Fig. 0] exhibit 
different degrees of transition from the l//" 1 (a\ ks 1) 
to the flattened l//" 2 (a 2 < 1) spectrum, with chromo- 
some 21 (22) undergoing the sharpest (smoothest) tran- 
sition. This observation can be further quantitized by 
the change in scaling exponents a% and a 2 . Table 1 lists 
for all chromosomes Aa = a% — ot\. Chromosome 22 is 
distinct from all other human chromosomes as the most 
scale-invariant one (same or similar scaling exponent at 
different length scales). The same observation that hu- 
man chromosome 22 was perhaps different from the re- 
maining human chromosomes was made using limited se- 
quence data in 0, |2(J • 



IV. INTERSPERSED REPEATS ARE NOT 
RESPONSIBLE FOR 1/f SPECTRUM 

About 45% of human genomic DNA sequences are in- 
terspersed repeats [19]. Interspersed repeats consist of 
copies of the same sequence segment that are inserted 
in the human genome, possess a high similarity between 
the duplicated and ancestral sequence, and have been 
implicated in a variety of biological functions, includ- 
ing genome organization, human chromosome segrega- 
tion, or regulation of gene expression |24|. Large copy 
numbers increase the sequence redundancy and it has 
been shown, e.g., that about 10% interspersed Alu re- 
peats significantly increase base-base correlations in the 
range up to 300 bases 

FigurcJSJshows the power spectrum S(f) for the original 
human chromosome 1 and for the transformed sequence 
in which interspersed repeats are substituted. We find in 
the low-frequency range of 10 -7 < / < 10~ 5 cycle/base 
that S(f) decays in the original sequence with a.\ ~ 0.88 
and in the transformed sequence with a[ « 0.80, indicat- 
ing only marginal differences in the decay properties of 
S(f) due to repetitive sequences. In contrast, in the high 
frequency range of 10~ 5 < / < 2 x 10 -4 we find a 2 « 0.46 
and a[ ~ 0.29, and thus interspersed repeats contributes 
to the decay properties of S(f) for high-frequency com- 
ponents by flattening the power spectrum. 

The scaling exponents a[ and a' 2 for repeat-substituted 
DNA sequences of all 24 human chromosomes are shown 
in Table 1. The difference between low- and high- 
frequency ranges for DNA sequences of original chromo- 
somes, Aa = a 2 — ai, is smaller than the difference be- 
tween low- and high-frequency ranges for transformed se- 
quences, Aa' = a 2 — a[. When we compare a\ and a[, as 
well as a 2 and a' 2 , we find that the magnitude of a[ (a' 2 ) 
is always smaller than that of ati (a 2 ), which means a flat- 
tened spectrum due to the substitution of interspersed 
repeats. The average change of low-frequency scaling 
exponents, a.\ — a' 1; is about 0.07, whereas the average 
change of high-frequency scaling exponents, a 2 — a' 2 , is 
about 0.14. This confirms that the universal presence 
of 1// spectrum at low frequency is not caused by inter- 
spersed repeats, but that interspersed repeats affect S(f) 
predominantly at high frequencies. A similar conclusion 
that the decay rate of base-base correlations in DNA se- 
quences of human chromosomes 20, Chr21, and Chr22 is 
not markedly affected by the substitution of interspersed 
repeats was reached in |6J. 

We note that the extent of deviation, \a' — a|, depends 
on how the replacement of interspersed repeats is con- 
ducted. Possible substitutions of interspersed repeats in- 
clude the substitution by a constant value or a randomly 
sampled value. In general, the substitution of GC% val- 
ues calculated from the repetitive sequences by random 
values enhances the deviation and flattens the spectrum 
S(f) more than the substitution by a constant value (e.g., 
average GC%). 
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FIG. 3: Double-logarithmic representation of power spectra S(f) of GC% of all twenty-four human chromosomes. Each plot 
shows S(f) of six chromosomes (shifted on the j/-axis for clearer representation): chromosomes (a) 1-6; (b) 7-12; (c) 13-18; (d) 
19-22, X, and Y. The :r-axis (in logarithmic scale) is converted from cycle/window to cycle/base by using the window sizes listed 
in Table 1. S(f) is filtered at different levels for different frequency ranges: S(f; s = 1) for the first ten spectral components, 
S(f; s = 3) for the components 11-30, S(f; s = 31) for the components 31-400, and S(f; s — 501) for the components 400-65536 
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FIG. 4: Cross-over from S{f) ~ l/f ai to S(f) ~ 1// Q ' 2 
illustrated for human chromosomes 15, 21, and 22 (smoothed 
with the span size of 31, and shown in double-logarithmic 
scale). The scaling exponents ct\ and Q2 are shown for the 
frequency ranges 10 Mb _1 -100 kb" 1 and 100 kb _1 -5k _1 . 



V. RESISTANCE TO VARIANCE REDUCTION 
AT LARGER WINDOW SIZES 
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FIG. 5: (a) Scaling exponents ai and «2 for fitting the power 
spectrum S(f) ~ l/f ai [i = 1,2) at the frequency range 
of 10 Mb _1 -100 kb" 1 , and 100 kb _1 -5 kb" 1 , respectively, 
versus the chromosome-specific GC content of all 24 human 
chromosomes, (b) Scaling exponents a[ and a' 2 for S(f) with 
substituted interspersed repeats. 



In this section, we study the decay properties of the 
variance (a 2 ) of spatial GC% series as a function of dif- 
ference window sizes w, and we compare the scaling of 
rj 2 with the scaling of the power spectrum S(f). 

Early experimental measurement of the GC% distribu- 
tion by using cesium chloride (CsCl) profile |25j showed 



for mouse Mus musculus genomic DNA sequences that 
the variance of GC% values does not markedly decreases 
with the DNA segment size [2(j. This experimental ob- 
servation is directly related to the presence of 1/f spec- 
tra in DNA sequences |3 E3- If the variance of the 
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FIG. 6: Power spectra S(f) of GC% for the original and 
the transformed (interspersed repeats substituted) DNA se- 
quence of human chromosome 1. The scaling exponent for 
low-frequency (10 Mb-100 kb) and high-frequency (100 kb- 
5 kb) ranges are obtained by a best-fit regression of log 10 S(f) 
over log 10 /. 



spatial GC% series calculated at the window size w is 
a 2 (w) , then a scaling of a 2 (w) ~ 1/ w@ implies a corre- 
sponding scaling in the power spectrum S(f) ~ 1// 1_/3 
[I4ll28j . If GC% is obtained from w uncorrelated bases, it 
follows a binomial distribution. Consequently, cr 2 (w) ~ 
— (p))/w ~ l/w with (3=1. The corresponding 
scaling exponent of the power spectrum is a = 1-/3 = 0, 
and thus the S(f) ~ cons, is equivalent to the white 
noise. 

Figurc[7|shows a 2 (w) as a function of window size w for 
all human chromosomes. In a double-logarithmic repre- 
sentation, we find that log (cr 2 (w)) decays approximately 
linearly with log(w) . A decay according to a 2 (w) ~ 1/ w l3 
with (3 = 1 leads to white noise. This situation is in- 
dicated in Fig. [7| by the straight line. An inspection of 
Fig.0shows, however, that the variance decays at a much 
slower rate than what would be for white noise. The 
variance of the DNA sequence of human chromosome 1, 
e.g., gives rise to (3 « 0.12, and the corresponding scal- 
ing exponent ^ ~ 1 - j3 = 0.88 is indeed close to the 
estimated exponent listed in Table 1. The scaling of the 
variance with the exponent f3 << 1 is in accord with the 
low-frequency l/f noise. 

The scaling of a 2 (w) shown in Fig. differs from one 
human chromosome to another. For instance, in the 
range of w = 1 kb-5 Mb, for example, human chro- 
mosome 13 exhibit a clear transition from (32 ~ 0.27 
(w < 50 kb) to (3x ~ 0.10 (w > 50 kb), corresponding to 
S(f) ~ l/f- 63 and S(f) ~ l/f - 9 , respectively, at high- 
and low-frequency ranges. Other human chromosomes, 
although generally exhibiting a power-law scaling form 
of a 2 (w), show deviations from a 2 (I) ~ 1/P line for the 
largest window sizes tested. 

The investigation of a 2 (w) as a function of different 
window sizes w requires careful examination |29l l30| . 
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FIG. 7: Double-logarithmic representation of the variance 
a 2 (w) of the spatial GC% series for all human chromosomes 
(Chr) as a function of the window size w: (a) O Chrl, A 
Chr2, + Chr3, x Chr4, Chr5, V Chr6; (b) O Chr7, A 
Chr8, + Chr9, x ChrlO, Chrll, v Chrl2; (c) O Chrl3, A 
Chrl4, + Chrl5, x Chrl6, Chrl7, V Chrl8; (d) O Chrl9, 
A Chr20, + Chr21, x Chr22, <> ChX, v ChrY. Straight lines 
indicate a 2 (w) ~ l/w (corresponding to white noise). One 
regression line for Chrl {(3 «0.12) and a piece- wise regression 
for Chrl3 {fi ^0.27 and (3 wO.10) are drawn. The 95% confi- 
dence interval for the a 2 (w) estimation of Chrl at each point 
of w is marked by a vertical dashed line. 



First, since we partition each human chromosome in 2 k 
(k =17, 16, . . . ) windows, the variance of GC% series 
{pi} could be accidentally large when windows reside 
on the isochore borders, and small by chance if they 
start/end within an isochore. 

Second, when the number of windows is small (e.g. 
the last point of a 2 (w) for each chromosome in Fig. [7| 
is calculated with the largest window size that gives 
rise to 32 windows), the standard error of the sam- 
ple variance is large. The 95% confidence interval for 
a 2 (w) of Chrl is shown in Fig. Ufa), using the interval: 
[{w-l)a 2 /t .Q 25 , {w-l)a 2 /t . 975 }, where t x is defined by 
S- X 2 (df = w— l)dt = x (where x 2 (df) is the chi-square 
distribution with df degrees of freedom) (3j. FigureQa) 
shows that for fewer windows (and larger window sizes), 
the 95% confidence interval of a 2 (w) could be large such 
that the estimated value of [3 may change from sample 
to sample. 

Finally, the relationship between scaling exponents a+ 
[3 = 1 [lj,|28|], is based on the assumption that both S(f) 
and a 2 (w) are theoretical power-law functions. If S(f) is 
a piece-wise power-law function, as in the case of GC% 
fluctuation of human chromosomes, a correction term to 
the relationship a + [3 = 1 is expected. 
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FIG. 8: Correlation function V(d) for 24 human chromosomes 
(Chr) as a function of the window distance d (converted to 
bases by the window size listed in Table 1). The distance is 
represented on a logarithmic scale, (a) Chrl-6; (b) Chr7-12; 
(c) Chrl3-18; and (d) Chrl9-22, ChrX, and ChrY. 
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FIG. 9: Correlation function T{d) for human chromosome 21 
as a function of the window distance d (converted to bases by 
the window size given in Table 1). The oscillation in T(d) 
is highlighted by vertical lines, indicating the distances of 
d =500 kb, 1 Mb, 1.5 Mb, and 2 Mb. 



VI. CHROMOSOME-SPECIFIC CORRELATION 
STRUCTURES 

Apparently, 1/ / noise in music and speech signals [12 
does not prevent music and speech from sounding differ- 
ently. Similarly, universal l/f a spectra in GC% fluctu- 
ations across human chromosomes do not imply that all 
chromosomes exhibit the same detailed correlation struc- 
ture. The generic trend of S(f) spectra to increase at low 
frequency may "co-exist" with small peaks at higher fre- 
quency. Such chromosome-specific characteristic length 
scales can be more intuitively examined by correlation 
functions. In this section, we investigate the correla- 
tion function T(d) of coarse-grained DNA sequences of 
human chromosomes with the aim of further examin- 
ing chromosome-specihc structures, such as characteristic 



length scales and oscillation detected by T(d). 

Figure [S] shows for all human chromosomes the r(d)'s 
of GC% series {pi} calculated for the window sizes given 
in Table 1, of all human chromosomes. For each chromo- 
some, the minimum (maximum) distance is 80 (16,000) 
windows. Since each chromosome is partitioned into 2 17 
windows, the maximum distance d at which the correla- 
tion is examined is about 16, 000/2 17 « 12% of the total 
sequence length. 

An inspection of Fig. [8] shows that the magnitude of 
correlation at the distance of d = 1 Mb is clearly above 
the noise level. With the exceptions of Chrl5, Chr22, and 
ChrY, the correlation function T(d) > 0.1 at d = 1 Mb 
for all other chromosomes. The low correlation in ChrY 
is due to the fact that about half of the bases are unse- 
quenced, and the substitution of gaps by random values 
lowers the correlation. At even longer distances such as 
d =10 Mb, correlations T(d — 10 Mb) for chromosomes 
1 and 6 are still above the 0.1 level. 

Given different windows (w) due to different chromo- 
some sizes and provided that the covariance of GC% is 
approximately independent of w, a scaling of the variance 
according to l/w@ implies that the correlation function 
r(d) in Eq.(|3J) increases with the window size as ~ . 
Test calculations of covariance for 2 15 and 2 17 windows 
show that the covariance differs by less than 1% (and 
hence is fairly independent in this range of window sizes). 
Consequently, for a detailed comparison of correlation 
functions calculated for different chromosomes one has 
to take into account different windows sizes. 

Any deviation from the monotonic decrease of T(d) 
might be indicative of correlations at characteristic 
length scales (visible as "bumps"). For example, Fig. 
shows for chromosome 1 such a bump at d » 21-23 Mb. 
Bumps or sharper peaks in other chromosomes include 
d w 9.3 Mb (Chr2), 7.2 Mb (ChrlO), 3.2-3.8 Mb (Chr 12), 
and 2.4-3.1 Mb (Chrl9). One plausible explanation is 
that for chromosomes 2, 10, 12, and 19 one or few al- 
terations of GC-rich/low isochores 0] with these length 
scales enhance the correlation. 

Chromosome 21 stands out among all human chromo- 
somes for having a comparatively higher correlations at 
distances of several Mb (despite having a smaller vr fac- 
tor than other chromosomes due to a smaller window 
size). A detailed inspection of Fig. uncovers an oscilla- 
tion of r(d) of about 500 kb, ranging from d =500 kb to 
d =2 Mb, which has not been reported before. It can be 
further shown that this oscillation is not due to the sub- 
stitution of interspersed repeats |33j . and it is localized 
to about one-eighth of the right distal end of chromosome 

21 m. 



VII. DISCUSSIONS 

We study correlation structures and spectral compo- 
nents in the set of human chromosomes, using power 
spectra, coarse-grained correlation functions, and the 
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variance of different window sizes. All three measures 
are interrelated and highlight compositional structures 
at different feature levels. Our results firmly establish 
the presence of long-ranging correlations and 1 / f a spec- 
tra in the DNA sequences of the set of twenty- four human 
chromosomes. 

Using updated and completed human sequence data, 
we find the presence of 1/f noise in the DNA sequences 
of all human chromosomes. We further find that, with 
the exception of chromosome 22, all chromosomes ex- 
hibit a cross over from 1/ f ai at low-frequency to l/f" 2 
scaling at high-frequency (ai > a-i). The result of two 
scaling ranges at low- and high-frequency are in accord 
with previous findings, obtained from sequence data of 
lower quality, and it refines break-point regions for each 
individual chromosome. 

We also examined the effect of about 45% interspersed 
repeats in the human genome. Using a procedure in 
which masks and subsequently substitutes interspersed 
repeats with random GC% values, we find that inter- 
spersed repeats (i) only marginally affect the scaling ex- 
ponent a\ in the low-frequency range, but (ii) lower oti 
in the high-frequency range (cf. Fig|5fb)). This supports 
the general understanding that interspersed repeats only 
contribute to short-ranging (high-frequency) correlations 

We have shown elsewhere that 1// Q spectra of GC% 
fluctuation are also universally present in the mouse Mus 
musculus genomic DNA sequences |34j . It is known that 
human and mouse genomes are separated by approxi- 
mately 65-75 million years of evolution. Besides the 
similarity (or homology ) between these two genomes on 
a local scale, there is in fact a large amount of reshuf- 
fling of the chromosome segments at a global scale when 
two current- day copies of the two genomes are compared 
side- by-side |35| • Since reshuffling of a sequence at global 
scales could potentially destroy long-range correlations, 
it is still to be resolved under what conditions a reshuf- 
fling of the human genome into the mouse genome, or 
vice versa, conserves I/f noise. 

One possible explanation of why f// Q spectra appear 
in both the human and the mouse genomes is that such 
long-range patterns were probably generated from an- 
cestral DNA sequences by sequence evolutionary mecha- 
nisms. One sequence evolution model, termed expansion- 
modification (EM) model, is known to generate l/f a 
spectra 7] . The EM model incorporates duplications and 
mutations. Since the duplication process is an essential 
element in evolutionary genomics |.36| , whose role is per- 
haps as important as Darwin's natural selection |37l | , even 
a yet unsophisticated incorporation of duplications in the 
EM model may capture the essence of the evolutionary 
origin of long-range correlations in DNA sequences. In 



the EM model, only the duplication of segments with the 
same length scale is included, whereas in reality segments 
with a broad range of length scales are duplicated . 

One frequently posed question concerns the "biologi- 
cal meaning" of l/f a spectra or long-range correlations 
in DNA sequences. In order to address this question, 
one may ask a couple of related questions beforehand. 
Does the compositional GC% have any biological ef- 
fects? What biological functions of the DNA molecule 
are of relevance? From the functional genomics per- 
spective, interesting biological processes related to DNA 
molecules include transcription, replication, and recombi- 
nation, and their potential connection to GC% has been 
reviewed in [27l l38l lo^. Generally speaking, GC% has 
a statistical association with all three processes, though 
the cause- and- effect role has not yet been firmly es- 
tablished. Recent studies show that broadly expressed 
"housekeeping genes" tend to be located in GC-rich re- 
gions |40|. To understand the genome- wide organization 
of biological units that play a role in those processes (e.g., 
genes, origins and timing of replication, or recombination 
hotspots), at times it is more feasible to directly study 
the spatial distribution of functional units instead of us- 
ing the GC% as a surrogate. 

From the biophysics and cellular biology perspective, 
GC% is linked with bands from chromosome-staining 
|4l| . and in addition, possibly with the matrix/scaffold 
attachment /associated regions located at the end of DNA 
loops 42] . It has also been suggested that GC-rich chro- 
mosomes (or regions) tend to be located in the interior 
of the nuclear during interphase and are more "open" in 
their tertiary structure, whereas GC-poor segments are 
more likely to be close to the surface of the nuclear and 
more condensed |43j . 

Further exploration of the relationship between GC% 
fluctuations, as well as its large-scale patterns, and the 
above biological processes is beyond the scope of this pa- 
per. An attempt for bacterial genomes has been made 
to relate the scale-invariance feature in sequence statis- 
tics to the genome organization of transcription activities 
|29j . It is clear that more integrated computational and 
experimental analyses need be carried out along similar 
lines before one can give universal 1/f spectra in DNA 
sequences a satisfactory biological explanation. 
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